Optimizing SMT Processors for High Single-Thread Performance

نویسندگان

  • Gautham Thambidorai
  • Donald Yeung
  • Seungryul Choi
چکیده

Simultaneous Multithreading (SMT) processors achieve high processor throughput at the expense of single-thread performance. This paper investigates resource allocation policies for SMT processors that preserve, as much as possible, the single-thread performance of designated “foreground” threads, while still permitting other “background” threads to share resources. Since background threads on such an SMT machine have a near-zero performance impact on foreground threads, we refer to the background threads as transparent threads. Transparent threads are ideal for performing low-priority or non-critical computations, with applications in process scheduling, subordinate multithreading, and on-line performance monitoring. To realize transparent threads, we propose three mechanisms for maintaining the transparency of background threads: slot prioritization, background thread instruction-window partitioning, and background thread flushing. In addition, we propose three mechanisms to boost background thread performance without sacrificing transparency: aggressive fetch partitioning, foreground thread instruction-window partitioning, and foreground thread flushing. We implement our mechanisms on a detailed simulator of an SMT processor, and evaluate them using 8 benchmarks, including 7 from the SPEC CPU2000 suite. Our results show when cache and branch predictor interference are factored out, background threads introduce less than 1% performance degradation on the foreground thread. Furthermore, maintaining the transparency of background threads reduces their throughput by only 23% relative to an equal priority scheme. To demonstrate the usefulness of transparent threads, we study Transparent Software Prefetching (TSP), an implementation of software data prefetching using transparent threads. Due to its near-zero overhead, TSP enables prefetch instrumentation for all loads in a program, eliminating the need for profiling. TSP, without any profile information, achieves a 9.52% gain across 6 SPEC benchmarks, whereas conventional software prefetching guided by cache-miss profiles increases performance by only 2.47%.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Integrating Multiple Forms of Multithreaded Execution on SMT Processors: A Quantitative Study with Scientific Workloads

Simultaneous multithreaded (SMT) processors have penetrated the mainstream computing market, since they offer a number of cost / performance advantages over conventional superscalar processors at a nominal additional cost. Simultaneous multithreading can be used in the execution engine of a single monolithic microprocessor, or be embedded and replicated in the execution cores of a chip multipro...

متن کامل

Comparing the Energy Efficiency of CMP and SMT Architectures for Multimedia Workloads

Chip multiprocessing (CMP) and simultaneous multithreading (SMT) are two recently adopted techniques for improving the throughput of general-purpose processors by using multithreading. These techniques are likely to benefit the increasingly important real-time multimedia workloads, which are inherently multithreaded. These workloads, however, often run in an energy constrained environment. This...

متن کامل

Thread-Sensitive Scheduling for SMT Processors

A simultaneous-multithreaded (SMT) processor executes multiple instructions from multiple threads every cycle. As a result, threads on SMT processors – unlike those on traditional shared-memory machines – simultaneously share all low-level hardware resources in a single CPU. Because of this fine-grained resource sharing, SMT threads have the ability to interfere or conflict with each other, as ...

متن کامل

Performance Evaluation of CSMT for VLIW Processors

Clustered VLIW embedded processors have become widespread due to benefits of simple hardware and low power. However, while some applications exhibit large amounts of instruction level parallelism (ILP) and benefit from very wide machines, others have little ILP, which wastes precious resources in wide processors. Simultaneous MultiThreading (SMT) is a well known technique that improves resource...

متن کامل

Optimizing Irregular Adaptive Applications on Multi-threaded Processors: The Case of Medium-Grain Parallel Delaunay Mesh Generation

The Importance of parallel mesh generation and emerging growth of SMT architectures raise an important question of adapting parallel mesh generation software to the SMT architecture. In this work we focus on Parallel Constrained Delaunay Mesh Generation. We explore medium grain parallelism at the sub-domain level. This parallel approach targets commercially available SMT processors. Our goal is...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • J. Instruction-Level Parallelism

دوره 5  شماره 

صفحات  -

تاریخ انتشار 2003